Skip to content

RFC: kernel compile separation + mixed-TP + MX client adoption (post-#2389)#2652

Draft
KavinKrishnan wants to merge 5 commits into
PrimeIntellect-ai:nixl_mxfrom
KavinKrishnan:kavink/post-2389-kernel-compile-plan
Draft

RFC: kernel compile separation + mixed-TP + MX client adoption (post-#2389)#2652
KavinKrishnan wants to merge 5 commits into
PrimeIntellect-ai:nixl_mxfrom
KavinKrishnan:kavink/post-2389-kernel-compile-plan

Conversation

@KavinKrishnan
Copy link
Copy Markdown

@KavinKrishnan KavinKrishnan commented May 27, 2026

Summary

This is a doc-only RFC layered on top of #2389. Proposes the next phase of work once #2389 merges to main:

  1. Phase 1 — six surgical fixes against the merged nixl_mx code (close the bug classes we hit during GB200 bring-up: cross-subnet add_remote_agent full-mesh, stale READY peer dedup, heartbeat / STALE-on-shutdown, hardcoded 1200 s timeout, non-MLA model guard for update_mla_absorbed_weights, HSDP barrier ordering).
  2. Phase 2 — graduate src/prime_rl/transport/mx_rendezvous.py (~185 LOC of in-tree rendezvous code) onto NVIDIA's published modelexpress Python clients (MxV2TrainingPublisher / MxV2RefitReceiver). Inherits heartbeat + freshest-per-rank dedup + retention + the v2 sidecar filter (modelexpress PR #295) for free. The in-tree NixlAgentWrapper, Slot, TransportPlan, and classic_cuda_pool stay untouched — that's prime-rl-specific data-plane specialization.
  3. Phase 3 — fixes the trainer-side kernel-compile pinning issue surfaced during the FP8 cast-pipeline iteration on this branch. Trainer publishes HF-raw bytes (kernel-agnostic) over NIXL; inference compiles into its target layout (DeepGemm, cutlass, …) via a receiver-side scratch-buffer pass. Extends the v2 shape registry with a compile_target + compile_metadata field so receivers filter on compatibility. Heterogeneous fleets (DeepGemm and cutlass on the same training run) now work without trainer-side branching.
  4. Phase 4 — generalizes the v2 sharding metadata to handle mixed-TP / mixed-EP via TargetTPLayout + multi-source slice discovery. Same machinery as our NemoRL v2 MoE expert filtering, generalized to dense matmul axes.

The plan pulls heavily on the NemoRL × Dynamo path (NVIDIA, @jthomson04) which is already running cross-node at 380 Gbps on GB300 RoCE for an 8.82 GB / 399-tensor refit on Qwen3-4B-Thinking-2507 — same scratch-buffer + worker_extension_cls pattern this plan adopts.

What's in this PR

Doc only. 517 lines at docs/proposals/post-pr2389-kernel-compile-plan.md. Includes component + per-refit sequence diagrams (mermaid). No code changes; implementation phases sequence behind this RFC's acceptance.

Why a draft RFC against nixl_mx

The plan only makes sense in the context of this branch's code. Targeting main now would dangle (no nixl_mx to build on). Plan: re-target to main once #2389 merges, then land Phase 1 quickly as a follow-up PR.

Estimated impact

Phase Net LOC
1 — surgical fixes ~100 (in-tree)
2 — client graduation −400 (mx_rendezvous.py deleted) + 150 (import-and-call)
3 — compile-target registry + receiver-side compile passes ~+45 modelexpress, ~+350 prime-rl
4 — mixed-TP / mixed-EP slice discovery ~+200 across both repos

Total ~450 LOC additive for Phases 3-4, plus the ~−400 LOC subtraction from Phase 2 maintenance burden.

Test plan

N/A — doc only. Each implementation phase ships its own test plan in the doc (see §8). Phase 3 validation piggybacks on the existing NemoRL+Dynamo GB300 cluster to de-risk the compile-pass design before porting into the prime-rl worker.


Note

Low Risk
Documentation only; no production code, config, or transport behavior changes in this PR.

Overview
Adds docs/proposals/post-pr2389-kernel-compile-plan.md, a doc-only RFC (~517 lines) for work after PR #2389 lands. It does not change runtime code.

The proposal keeps the existing nixl_mx data plane (Slot, TransportPlan, NixlAgentWrapper, pools) and plans rendezvous/metadata extensions only:

  • Phase 1: Six targeted fixes (same-rank remote agents, freshest-per-rank dedup, heartbeat/STALE, configurable timeouts, MLA guard, HSDP barrier order).
  • Phase 2: Replace in-tree MxRendezvous with ModelExpress MxV2TrainingPublisher / MxV2RefitReceiver; adopt worker_extension_cls on the vLLM worker.
  • Phase 3: Move kernel layout compile to inference via scratch buffers + pluggable CompilePass (hf_raw, DeepGemm, cutlass); extend the v2 registry with compile_target / compile_metadata and compile_target_filter on discovery.
  • Phase 4: Mixed TP/EP via TargetTPLayout, slice-aware receive_weights_scratch, and multi-source discover_v2_sources_for_slice.

The doc includes mermaid architecture/sequence diagrams, phased LOC estimates, open questions, and links to ModelExpress/NemoRL validation paths (e.g. scratch refit on GB300).

Reviewed by Cursor Bugbot for commit 7feee0d. Bugbot is set up for automated code reviews on this repo. Configure here.

…le, mixed-TP, MX clients

Proposes the next phase of work on top of `nixl_mx` once PrimeIntellect-ai#2389 merges:

1. Phase-1 — six surgical fixes against the in-tree code that close
   the bug classes we hit during GB200 bring-up (cross-subnet
   add_remote_agent full-mesh; stale READY peer dedup; heartbeat /
   STALE-on-shutdown; hardcoded 1200s timeout; non-MLA model guard;
   HSDP barrier ordering). Line-pinned against HEAD `79ea824d8`.

2. Phase-2 — graduate `src/prime_rl/transport/mx_rendezvous.py` onto
   NVIDIA's published `modelexpress` Python clients
   (`MxV2TrainingPublisher` / `MxV2RefitReceiver`). Deletes ~185 LOC
   of in-tree rendezvous that duplicates the upstream client.
   Inherits heartbeat + freshest-per-rank dedup + retention +
   sidecar-filter for free. `NixlAgentWrapper` / `Slot` /
   `TransportPlan` / `classic_cuda_pool` stay — those are prime-rl
   specialization.

3. Phase-3 — solves the trainer-side kernel-compile issue surfaced
   during PrimeIntellect-ai#2389's FP8 cast-pipeline iteration. Trainer publishes
   HF-raw bytes (kernel-agnostic); inference compiles into its
   target layout (DeepGemm, cutlass, ...) via a receiver-side
   scratch-buffer pass. Extends the v2 shape registry with
   `compile_target` + `compile_metadata`. Heterogeneous fleets
   (mixed kernels on the same training run) now work without
   trainer-side branching.

4. Phase-3 also generalizes the v2 sharding metadata to handle
   mixed-TP/EP via `TargetTPLayout` + multi-source slice discovery
   in the same machinery NemoRL v2 uses for MoE expert filtering.

Pulls heavily on the NemoRL × Dynamo path (NVIDIA, John Thompson)
which is already running at 380 Gbps on GB300 RoCE for an 8.82 GB
refit — same scratch-buffer + worker-extension-cls pattern this
plan adopts.

Component + per-refit sequence diagrams (mermaid) included.
Estimated ~450 LOC additive across modelexpress + prime-rl for
Phases 3-4 (plus the ~400 LOC subtraction from Phase 2).

Doc only. Implementation phases sequenced behind the upstream
merge of PrimeIntellect-ai#2389.
@KavinKrishnan KavinKrishnan marked this pull request as draft May 27, 2026 15:20
… (v0.7.x)

Captures the empirical findings from baking PRs #1 and #2 into an ARM64
GB200 image and running it on the kavin namespace for 8+ hours on
Qwen3-30B-A3B-Instruct-2507 with gsm8k.

Documents three real surprises the unit tests didn't cover:

1. Dockerfile.cuda's `uv sync` is missing `--extra disagg`, so modelexpress
   isn't installed in stock images; inference workers crash at the first
   import. Shipped v0.7.1 as a one-line overlay that adds the extra until
   the upstream Dockerfile.cuda can be updated.

2. `LD_PRELOAD` path for libcudart.so.12 — v0.5.2 had /usr/local/cuda
   present in the final stage; v0.7.0 (built from upstream Dockerfile.cuda
   as-is) doesn't. The pip-installed wheel path
   (/app/.venv/lib/python3.12/site-packages/nvidia/cuda_runtime/lib/) is
   the new canonical location.

3. The configmap monkeypatch (patch_nixl_mx.py) and Phase 2's source-baked
   fixes are complementary — they patch different layers (broadcast vs
   rendezvous-wait) and both should stay until PR #1 merges upstream.

Build experience numbers:
  - v0.7.0 from-scratch ARM64 build under QEMU: 6h45min (uv sync 45m,
    flash-attn from source 3h45m).
  - v0.7.1 overlay on top of v0.7.0: ~3 min.

Cluster observations from v0.5.2 + configmap monkeypatch (the
runtime-patched path our PR #1 codifies into source):
  - 183 successful RL refit cycles in one 66-min uninterrupted window
  - Reward variance 0.5-1.0 across orchestrator steps (real learning)
  - Off-policy level = 0 throughout
  - Zero NIXL data-plane errors
  - Recurring orchestrator wait_for_all_peers_ready timeout (~once per
    30-66 min) is the exact bug class Phase 2's rendezvous-level dedup
    eliminates

Also notes seven RFC updates queued in
pensieve/RL/PrimeRL/09_rfc_updates_needed.md, three of which are new from
this build experience (disagg extra, LD_PRELOAD path, vLLM PR #43375 /
Anyscale RDT positioning).

Companion to the RFC at docs/proposals/post-pr2389-kernel-compile-plan.md.
…/3/4 upstream form

vLLM published https://vllm.ai/blog/2026-05-28-native-rl-apis the same day,
announcing a standardized WeightTransferEngine abstract base + 4-phase
lifecycle (init / start / update / finish) + a pluggable
WeightTransferEngineFactory.register_engine(...) extension point.

This is the upstream integration seam that the in-tree MxRendezvous
reimplementation in PR PrimeIntellect-ai#2389 and the worker_extension_cls injection in
inference/vllm/worker/nixl_mx.py have been emulating. The cleanest form
of all our Phase 2/3/4 work upstream is a single MxWeightTransferEngine
adapter (~150-200 LOC) that subclasses WeightTransferEngine and wraps the
existing MxV2RefitReceiver + MxV2TrainingPublisher.

Three immediate consequences captured in §8:

  §8.1 — Phase 2/3/4 should be repackaged as MxWeightTransferEngine for
         upstream contribution; the existing patches stay correct, the
         packaging just becomes upstream-native.

  §8.2 — The blog credits Matej Sirovatka specifically. He's likely
         mid-flight on a native-APIs rewrite of prime-rl's nixl_mx
         broadcast. Ask him before pushing Phase 2 upstream; the work
         may retarget to the adapter path directly.

  §8.3 — Their validation was at 16x 8xH200, DPEP32, 256 GPUs total. That
         scale makes Phase 4's multi-source slice planning load-bearing
         (mixed-TP/EP is the common case), not optional. Validates the
         design direction and sets the next cluster validation target
         after the DP=4 kavin smoke.

  §8.4 — pause_generation(mode="keep") + two-phase DPEP pause are
         features we don't yet match. Keep mode unlocks true async RL;
         queue after Phase 2 lands.

Updated follow-up list grows from 4 to 7 items, with the three new ones
being: write MxWeightTransferEngine, adopt keep-mode pause in the
orchestrator, and coordinate with Robert Shaw / the vLLM RL roadmap on
the K8s-native weight transfer engine they mention as ongoing work
(which describes MX itself, modulo who's driving the upstream PR).
…three docs

The three proposal docs now form a coherent set:

  - post-pr2389-status-and-plan.md  — executive summary; failure-class
                                       to fix mapping; mermaid diagram
                                       of the data + metadata planes;
                                       Phase 0 unblock guidance
  - post-pr2389-kernel-compile-plan.md — full RFC with phase-by-phase
                                          design rationale (unchanged
                                          except for cross-link header)
  - build-notes-2026-05-28.md        — operational findings from the
                                       source-baked image build, plus
                                       the vLLM native RL APIs reframe
                                       in section 8

Each doc now has a header block linking to the other two so readers
can navigate based on intent (status vs design vs operational).

The status-and-plan doc is the natural entry point for someone coming
to the work cold; the RFC and build-notes are the deep dives.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant